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Abstract 

We explore some aspects of the relationship between biological evolution pro- 
cesses and the mathematical theory of records. For Eigen's quasispecies model with 
an uncorrelated fitness landscape, we show that the evolutionary trajectories traced 
out by a population initially localized at a randomly chosen point in sequence space 
can be described in close analogy to record dynamics, with two complications. First, 
the increasing number of genotypes that become available with increasing distance 
from the starting point implies that fitness records are more frequent than for the 
standard case of independent, identically distributed random variables. Second, fit- 
ness records can be bypassed, which strongly reduces the number of genotypes that 
take part in an evolutionary trajectory. For exponential and Gaussian fitness distri- 
butions, this number scales with sequence length ./V as \^N, and it is of order unity 
for distributions with a power law tail. This is in strong contrast to the number of 
records, which is of order N for any fitness distribution. 
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extremal statistics, quasispecies model 



1 Records and evolution 



In the Darwinian view of nature [1], biological evolution is a fierce competi- 
tion among different organisms in which the winners are rewarded by copious 
offspring while the losers perish. It should therefore be no surprise to see 
metaphors from the world of athletics turn up in the description of evolution- 
ary dynamics. Indeed, every evolutionary innovation that is fixed in a popula- 
tion has to be a record, in the sense that it solves some problem encountered 
by the organism in a way that is superior to all existing solutions. A possible 
mathematical relationship between evolution models and the theory of records 
was suggested by Kauffman and Levin in the context of long-jump adaption 
on correlated fitness landscapes [2] , and has more recently been elaborated by 
Sibani and coworkers [3]. 
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The basic problem of record statistics can be formulated as follows [4,5]: Given 
an ordered sequence {X n } n=12) 3... of real random variables (RV's), a record 
occurs at r iff 



X r = max{X n }. 



n<r 



(1) 



By convention, X 1 is always a record, and through application of (1) a series 
of record times {r k }k=i,2,3... and record values {X rk } is generated from the 
underlying sequence {X n }, with r k denoting the time of the fc'th record and 
r 1 — 1. Many properties of records are known for the case when the X n are 
independent and identically distributed (i.i.d.). In particular, the statistics 
of the record times is completely independent of the underlying probability 
distribution. This is largely a consequence of a simple symmetry argument [6]. 
Denote by P n the probability that a record occurs at n. In the i.i.d. case, each 
of the n RV's {X±, X 2: X n } is equally likely to be the largest, and hence 
P n — 1/n. In particular, the expected number of records (R(n)) up to time n 
is equal to 



The full distribution of R{n) becomes Poissonian for large n, and the record se- 
quence can be described as a Poisson process in logarithmic time ln(ra) [3]. Fur- 
thermore, it can be shown that the ratios of subsequent record times r k /r k+ i 
become uniformly distributed, independent random variables for large k [4]. 
This implies that the sequence {r k } of record times has some rather coun- 
terintuitive properties; for example, given the time r k of the fc'th record, the 
expected time of the preceding record is (r k -i) = r k /2, while the expected 
time (r k+ i) of the next record is infinite 1 . 

The record sequence is distinctly non-stationary: With increasing time, it be- 
comes exponentially harder to beat the current record. For this reason record 
dynamics and the associated log-Poisson process has been invoked to describe 
the nonstationary aspects of macroevolutionary dynamics [3] (as evidenced 
e.g. by extinction and origination rates of taxa in the fossil record [7]), as well 
as the relaxation of disordered systems such as spin glasses [8]. The pattern of 
static periods of exponentially increasing duration interspersed by rare events 
of rapid change (new records) is a simple realization of punctuated equilibrium, 
an important paradigm of evolutionary theory [9,10]. 



The latter property invalidates an argumentation based on the average waiting 
time for the next record, which has lead Kauffman and Levin to conclude (erro- 
neously) that the number of records grows as log 2 (n) rather than as ln(n) [2]. 



(R(n)) = ^ - ~ H n ) + 0.57721566... + 0(l/n). 



^ 1 



(2) 



i=i 



2 



Here we approach the relation between evolution and records from the point 
of view of population dynamics on the space of genetic sequences [11,12]. 
We show how the properties of sequence space introduce modifications to the 
standard record problem, which are of interest in their own right, and only 
partly understood at present. Some basic notions are introduced in the next 
section, and the remaining sections summarize the main results of a detailed 
investigation, which will be published elsewhere [13]. 



2 Sequence space and fitness landscape 

The proper arena in which to describe evolutionary dynamics is the space 
of genotypes, which are represented as sequences a = (ai, 02, <7jv) of iV 
symbols taken from an alphabet of £ letters; for DNA sequences £ = 4, but in 
many theoretical studies binary sequences {£ = 2) are considered for simplicity. 
The total number of possible sequences is S = £ N . The nearest neighbors of 
a given sequence a are those sequences a' that can be reached from a by a 
single point mutation, which alters one of the N symbols. More generally, the 
Hamming distance d(o~, a') beween two sequences a and a' is the number of 
symbols in which the two differ. An important quantity in what follows is the 
number a fe of sequences at distance k from a given sequence, which takes the 
form 

«* = (Tj i! - (3) 

This can be derived by noting that there are (^j ways of choosing k mutation 
sites on the sequence, and at each site £—1 different symbols are available. The 
maximum distance between two sequences is N. For large N (3) takes the form 
of a Gaussian of width \/N centered around the distance /c max = N(£ — 1) /£ 
at which the majority of sequences reside. 

Next we have to associate a fitness with each sequence a. We define the fitness 
W(a), in the Wrightian sense [14], to be proportional to the expected number 
of offspring of an individual carrying the genotype a [15,16]. This implies that 
W(a) > 0, and only ratios of fitnesses matter. We can thus write W(o~) = 
e /3F(a) £ introduce an inverse selective temperature (3 [16] for later use. In the 
following both W and F will be referred to as "fitness" . 

The mapping from genotype to fitness is largely unknown, but it is expected 
to be very complicated. We therefore follow a common practice and assume 
the F(a) to be quenched i.i.d. RV's drawn from some distribution p(F); in 
statistical physics this is known as the random energy model (REM) of spin 
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glasses [15,17], while in the context of evolutionary biology it has been referred 
to as the house of cards model [14] or the uncorrelated fitness landscape [2]. 
Many properties of the REM fitness landscape, such as the number of local 
fitness maxima and the length of uphill adaptive walks [2,18], can be derived 
using simple ideas from order statistics [19]. It is of particular interest to find 
properties that are independent of the fitness distribution. For example, the 
probability that a given sequence is a local maximum is equal to the probability 
that it has the largest fitness in the set of sequences comprising its (£ — 1)N 
nearest neighbors and itself; by the symmetry argument of Sect.l, this is just 



Important characteristics of the REM landscape needed in the following dis- 
cussion are the expected maximum fitness value F max (S) that occurs among 
the S independent sequences, and the fitness gap e, which is the difference 
between the largest and the second largest fitness value [11]. A simple esti- 
mate for the maximum fitness is obtained by setting the cumulative fitness 
distribution p c (F) equal to 1 — 1/S [20], 



and the fitness gap is of the order of e ~ [Sp(F max (S))] 1 [11,12]. 



3 Records in sequence space 

Kauffman and Levin [2] found record statistics to be applicable in a situation 
where a population, assumed to be localized at a single sequence at all times, 
explores sequence space by random mutations of arbitrarily long range, and 
moves to a new location whenever the fitness of the mutant exceeds that of 
the present position. To highlight the role of the geometry of sequence space, 
we consider here a variant of their model where the range of mutations is 
restricted but grows in the course of time. At time t = the entire population 
resides at a randomly chosen "seed" sequence a . At the integer time t > 0, the 
population has access to all genotypes within Hamming distance k — t of <7o, 
and it always resides in its entirety at the sequence of maximum fitness within 
the accessible region. Thus the current position of the population in sequence 
space jumps whenever a fitness record occurs among the oik sequences which 
become newly available at time t = k. 

The analysis of this model requires a slight generalization of the basic sym- 
metry argument of record statistics outlined above, which is adapted to a 
situation where a variable number of new i.i.d. RV's is introduced at each 



[(£-i)N + l] 



-i 




(4) 



— oo 
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time step 2 [5,21]: As the newly introduced RV's are indistinguishable from 
those that have appeared at earlier times, the probability that a record occurs 
among them is simply equal to 



In the last step the expression (3) has been inserted and an expansion for 
k, N — > oo at fixed k/N has been carried out [13]. The probability starts out 
at unity and dwindles to zero as k approaches the value /c max of the Hamming 
distance at which the majority of sequences reside; the process stops at t — 
kmax, when the globally fittest sequence (which is located with certainty at 
^max for large N) is reached. In contrast to the logarithmic increase (2) in the 
i.i.d. case, here new records are found quite frequently, at least when k <C N. 
This is because of the exponential growth of the number of available sequences 
with increasing distance from the seed, which compensates the scarcity of new 
records. 

Integrating (5) from k — to k ma , x one finds that the mean of the total number 
of records R that are encountered during the evolution is given by 

(R) = (l " ^) N. (6) 



It can be shown that the occurrences of records are independent events in this 
model [5,13], and hence the variance and higher moments of R can also be 
computed from the Pk- The variance is 

(R 2 ) - (R) 2 = £ A ~ p k ~ -fr\ (]rz\ lni - 2 ) (7) 



for large N, which decreases with increasing t. Thus asymptotically R is a 
normal RV with fluctuations of order \fN. In addition, analytic results for the 
the spacings between records are reported in [13]. 



2 This generalization was originally introduced to investigate whether the frequent 
breaking of records in the Olympic games can be attributed to the fact that the 
athletes are selected from exponentially growing populations [21]. The conclusion 
was that population growth is not sufficient to explain the data. 
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4 Quasispecies evolution in the strong selection limit 

For a somewhat more realistic description of the population dynamics, we 
turn to Eigen's quasispecies model [22,23], arguably the simplest mathemat- 
ical model that implements the basic mechanisms of selection and mutation 
for a genetically heterogeneous population on the level of the sequence space 
[14] . The model was introduced to describe the population dynamics of asexu- 
ally reproducing entities like self-replicating macromolecules. It can be applied 
whenever the population size is large, so that the number of individuals oc- 
cupying a given site in sequence space can be represented by a continuous 
variable. Because of the exponential proliferation of the number of sequences 
with increasing N, real populations are very sparse in sequence space, which 
severely limits the applicability of a continuum description. We nevertheless 
believe that it is important to first understand the long time dynamics of se- 
quence space evolution in the continuum setting, before taking into account 
the effects of the discreteness of real populations. 

In the quasispecies model, the population Z(cr,t) of genotype a at time t 
evolves in discrete time according to the linear recursion relation 

Z(a,t+ 1) = - a)W(a')Z(a',t), (8) 

where p(o~' — > a) is the mutation probability that sequence a appears as 
offspring of sequence a'. Assuming that single point mutations occur with 
probability fi per generation, the mutation probability takes the form 

p(a' -> a) = fi d ^ a '\l - fi) N ' d{ ^'\ (9) 

Consider a population that is initially localized at a seed sequence (Jo, i.e., the 
initial condition for (8) is Z(a, 0) = Z 5 ajao . Then after one time step we have 

Z(a, 1) = Z W(a )(l - fi) N [fi/(l - ~ exp[-d(<r, a )/X}. (10) 

The population density is now nonzero everywhere, with a magnitude decaying 
exponentially with increasing distance from the seed sequence, where the decay 
length is A = 1/ ln[l//x — 1]. At this point individuals with genotypes far away 
from the seed start to compete with the majority of the population still located 
at <7 . To quantify this competition, we follow the location of the current leader 
cr*(t), which is defined as the sequence at which Z(a, t) is maximal. The path of 
c*(t) describes an evolutionary trajectory in sequence space [11,12]. Along such 
a trajectory the fitness F(o~*) increases in a stepwise fashion, similar to the 
fitness trajectories observed in experimental studies of microbial populations 
[24,25]. 
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The dynamics of evolutionary trajectories is simple in a strong selection limit 
modeled after the zero temperature limit of the statistical physics of disordered 
systems [11,12]. Writing (j, = e^ 13 "' and taking the inverse selective temperature 
(3 — > oo, one obtains a recursion relation for the logarithmic population vari- 
able E{a,t) defined by Z{a,t) = eP E ^ . As was shown in [12], the behavior 
remains essentially unchanged if the mutational part of the dynamics is turned 
off after the first time step. This implies that for t > 2 the population at each 
site a grows independently, at its own logarithmic rate F(a), according to 



E(a,t) = E(a,l) + F(a)(t-1) = F(a ) - -yd(a,<T ) + F(a)(t - 1). (11) 



Here the initial condition (10) has been inserted. Equation (11) is a particularly 
transparent representation of the evolutionary race. Each genotype advances 
at its own speed F(o~), from an initial position determined by its distance from 
the seed sequence o~q. In the course of time, the leadership in the population 
changes from sequences with relatively low fitness located close to cr to more 
distant sequences of larger fitness, until eventually the globally fittest sequence 
is reached and the race comes to an end. At any given time the current leader 
cr*(t) satisfies E(o~*(t),t) = max (T {E(a, t)}; that is, E(o~*(t),t) is the upper 
envelope of the family of straight lines defined by (11), and leadership changes 
correspond to the corners of the envelope. The leadership changes are precisely 
the jumps in the punctuated evolutionary trajectory, and their statistics will 
be discussed in the next section. 



5 Bypassing 

Several properties of evolutionary trajectories follow immediately from the 
representation (11). First, since all sequences within a shell of constant dis- 
tance d(a, (To) start with the same population at t — 1, only the sequence with 
the largest fitness within each shell has a chance of ever attaining the lead- 
ership. Second, in order to become the new leader, the fitness of a sequence 
has to exceed that of the current leader, i.e. the sequence has to be a record 
in the sense of Sect. 3. Thus, among the £ N available genotypes only a small 
fraction given by the mean number of records (6) is eligible to become part of 
the evolutionary trajectory. 

However, not every record will become a leader. To see this, suppose the 
current leader is at a, and let a' be a subsequent record with F(a') > F(a) 
and d(a ,a') > d(a ,a). Then E(a,t) and E(a',t) will cross at time 




(12) 
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Fig. 1. Simulation data for the probability Qk for an evolutionary jump to occur 
at distance k from the seed site. Main figure shows data for Gaussian (O, A) and 
exponential (+, x) fitness distributions, two different sequence lengths N, and al- 
phabet size t = 2, on a double logarithmic scale. Inset shows data for exponential 
fitness distribution, N = 512, and three different values of t on a linear scale. The 
data were averaged over 10 5 (main figure) and 10 4 (inset) disorder configurations, 
respectively. 

The leadership will be taken over by the sequence a' that minimizes the cross- 
ing time (12), which does not need to be the next record in line. We say that a 
sequence U\ is bypassed by a sequence 02 with d(ao, a) < d(a , a±) < d(cr , cr 2 ), 
if T(<7, 02) < T(a, <7i ). Because of bypassing, the number of records (6) is only 
an upper bound on the number of leadership changes. 

In contrast to the properties of the records discussed in Sects. 1 and 3, which 
are independent of the underlying fitness distribution, the prevalence of by- 
passing depends on p(F) [13]. We can get some insight into the behavior by es- 
timating the typical time T* = T{a^ _1 \ 0^) at which the penultimate leader 
0(/- x ) is overtaken by the sequence 0^ with globally maximal fitness [12]. As 
both 0^) and a^^ 1 ^ are expected to reside within a belt of thickness 
around /c max , we have d(a ,a^) — d(a , a^^) ~ y/~N. The fitness difference 
_p(cr(/)) — F{a^~ 1 ) should be of the order of the fitness gap of the landscape. 
For example, for a fitness distribution with a power law tail p(F) ~ F~^ +1 \ 
we have according to (4) that 

5-i/Ai = (Njix^ and the fitnegs 

gap is of the 

same order. This implies that the crossing time T* ~ \/N ji N ^ decreases 3 
with increasing iV; for large iV all intermediate records are bypassed, and the 
globally fittest sequence immediately takes over the leadership. 



Due to the rare occurrence of landscapes with a very small fitness gap, the mean 
crossing time is nevertheless infinite: The distribution of T* has a universal 1/(T*) 2 
tail with a prefactor that vanishes for ./V — > 00 for power law fitness distributions 
[12]. 
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Nontrivial behavior is found when the fitness gap decreases with N, or in- 
creases more slowly than y/N. In Fig.l we show numerical data for Gaussian 
and exponential fitness distributions, for which the gap is of order unity inde- 
pendent of N. Very large sequence lengths can be treated by using the shell 
fitness Fk, which is the largest of at i.i.d. RV's [12]; in this way the number of 
RV's that are needed for each realization reduces from £ N to N. The genera- 
tion of the Fk is feasible despite the astronomically large values of ctk because 
the maximum of ol\. exponential or Gaussian RV's is only of order lna^ or 
\/\nak, respectively [compare to (4)]. The key result illustrated in Fig.l is the 
scaling form 

Q k » N-V 2 f(k/N) (13) 



for the probability Qk for an evolutionary jump to occur at distance k from the 
seed. The total number of jumps is of order \/~N, and hence most of the O(N) 
records are bypassed. The scaling function f(x) is cut off at k ma _ x /N = l — l/£, 
but its shape appears to be independent of the alphabet size £ (inset of Fig.l). 
For both Gaussian and exponential fitness distributions, the behavior of the 
scaling function at small arguments is close to f(x) ~ x~ l l 2 . This behavior 
would imply that Qk ~ 1/ \/~k independent of N for k <C N, and that the 
number of jumps grows as \/k with increasing distance from the seed site. 
We expect these results to be generally valid for fitness distributions in the 
Gumbel universality class of extreme value theory [20] . The case of bounded 
fitness distributions should also be interesting, but has not been treated so far 
because of the difficulty in creating the shell fitnesses for large N. 

An analytic understanding of (13) is lacking at present, and must be left to 
future work. In fact, as is explained in detail in [13], the statistics of bypassing 
is difficult to handle analytically even for the simple case when the geome- 
try of sequence space is ignored and the shell fitnesses are replaced by i.i.d. 
RV's. It is remarkable that the innocuous generalization of the basic record 
model, defined by the family (11) of lines with random slopes, leads to a rather 
involved and rich probabilistic problem. 
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